> Karpathy Autoresearch — Deep Research Report

Budding

planted Mar 8, 2026tended May 4, 2026

#research#ai#agents#autoresearch#karpathy#ollama#gpu#ml-training

Karpathy's Autoresearch & Autonomous AI Agents on Local GPUs

Deep research on autonomous AI agents running ML experiments with one GPU per agent — architecture, multi-agent patterns, security crises, and practical replication on consumer hardware.

Executive summary

Autoresearch is Karpathy's open-source framework where an AI agent autonomously runs ~100 LLM training experiments overnight on a single GPU, modifying code in 5-minute cycles and keeping improvements via git commits.
Multi-agent research (8 agents, 1 GPU each) showed parallelism works but scientific judgment fails — agents can implement ideas but can't design good experiments. Karpathy's next vision is SETI@home-style distributed agent collaboration.
My server (3× GTX 1660 Ti 6GB + 1× GTX 1080 8GB) can run autoresearch training at reduced batch sizes (depth 12 = GPT-1 size), and each GPU can independently serve 1.5B–8B inference models at 16–40 tokens/second via Ollama. (See Autonomous Agent Arena for the inference build.)
Security is critical — the OpenClaw crisis (800+ malicious skills, 42K exposed instances, 1-click RCE) shows that local agent frameworks need containerization, auth, and no untrusted plugins.

1. Autoresearch — architecture and how it works

Autoresearch — released March 7, 2026. Self-contained repo distilling nanochat into a single-GPU, single-file experiment platform for autonomous AI agents.

Three files, one loop

| File | Role | Editable? | |------|------|-----------| | program.md | Human-written agent instructions | Human only | | train.py | Model architecture, optimizer, training loop (~630 lines) | Agent only | | results.tsv | Experiment log | Append-only |

The agent's loop:

Read the latest results.tsv to see what's been tried.
Modify train.py — change architecture, hyperparameters, optimizer.
Run training for a fixed 5-minute window.
Decide keep (commit) or revert (discard) based on validation bits-per-byte (val_bpb).
Log to results.tsv: commit hash, val_bpb, VRAM usage, status, description.
Repeat indefinitely.

Yields ~12 experiments/hour, ~100 overnight. Fixed time budget makes experiments comparable regardless of what changes. Agent works on git branch autoresearch/<tag>.

Setup

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run prepare.py  # one-time data prep (~2 min)
uv run train.py    # verify single experiment works (~5 min)

Then point Claude Code or Codex at the repo with program.md as context and let it run overnight with --dangerously-skip-permissions.

Key design decisions

val_bpb is vocabulary-size-independent — agent can change tokenizer and still compare fairly.
5-minute window makes results platform-specific but internally consistent.
Simplicity rule: prefer simpler solutions; complexity only justified by substantial improvement.
Kill runs over 10 minutes; never pause to ask permission.

Sources: GitHub repo, program.md.

2. Nanochat — the training foundation

Nanochat — "the best ChatGPT that $100 can buy." Autoresearch is a stripped-down single-GPU fork.

Key specs

Optimizers: Muon + AdamW.
Tokenization: BPE.
Depth parameter: --depth N controls model size. d12 = GPT-1 (12 layers, ~6 min). d26 = GPT-2 capability.
Default batch size: 32 (needs 80GB VRAM). Reduce --device_batch_size to 16/8/4/2/1 for smaller GPUs.
Benchmark (March 2026): GPT-2 capability in 2 hours on 8×H100, ~$48. Uses NVIDIA ClimbMix dataset and fp8 training.

Consumer GPU feasibility

| GPU | VRAM | Feasibility | Batch size | |-----|------|-------------|------------| | H100 | 80GB | Default target | 32 | | RTX 4090 | 24GB | Works with reduced batch | 8–16 | | RTX 3080 | 10GB | Confirmed working | 2–4 | | GTX 1080 | 8GB | Should work | 1–2 | | GTX 1660 Ti | 6GB | Tight, depth 12 likely max | 1 |

At depth 12 with batch size 1–2, the 5-minute window produces ~10M-parameter models — too small for emergent behaviors, but valid for hyperparameter and optimizer research.

3. Multi-agent collaboration — what Karpathy tried

8-agent experiment (February 27, 2026)

8 agents simultaneously — 4 Claude, 4 Codex — each with 1 GPU, all working on the same problem: remove the logit softcap from nanochat without regression.

Organizational structures tested:

8 independent solo researchers
1 chief scientist directing 8 junior researchers

Infrastructure (lightweight, no Docker / VMs):

Git branches per research program, feature branches per agent
Git worktrees for filesystem isolation
tmux grid sessions for monitoring
File-based inter-agent communication

Result: "It doesn't work and it's a mess... but it's still very pretty to look at."

Parallelism worked perfectly. The bottleneck was scientific judgment — agents are "very good at implementing any given well-scoped and described idea but they don't creatively generate them." They didn't design experiments carefully, ran nonsensical variations, skipped strong baselines, and didn't control for compute.

SETI@home vision (March 8, 2026)

"The goal is not to emulate a single PhD student, it's to emulate a research community of them."

Autoresearch as a "seed" — agents on different GPUs / platforms contribute commits in different research directions. GitHub is "almost but not really suited" because of single-master assumption. Proposed agents reading GitHub Discussions / PRs for inspiration, contributing "papers" of findings back.

"Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks."

4. Tools and infrastructure for agent loops

Claude Code

--dangerously-skip-permissions — fully unattended execution mode.
Git checkpoints — git add -A && git commit before every session.
Safety — run in container without internet access.

continuous-claude

continuous-claude automates the loop: branch → Claude Code → commit → PR → CI check → merge/discard → repeat. Config: --max-runs, --max-cost (USD), --max-duration. Uses SHARED_TASK_NOTES.md as persistent memory. Supports parallel execution via git worktrees.

OpenAI Codex CLI

Codex CLI — terminal agent in Rust. AGENTS.md for instructions. 1M+ weekly developers. Experimental multi-agent collaboration.

Karpathy's stack

tmux grid sessions, git worktrees for isolation, file-based comms, no Docker / VMs — lightweight, direct GPU access.

5. Replicating on my server (4 GPUs)

My hardware

| GPU | VRAM | PCIe | Best for | |-----|------|------|----------| | GPU 0: GTX 1080 | 8 GB | Gen2 x4 | Largest models (8B Q4), autoresearch training | | GPU 1: GTX 1660 Ti | 6 GB | Gen2 x1 | Frigate NVR (current) + small inference | | GPU 2: GTX 1660 Ti | 6 GB | Gen2 x1 | Dedicated inference agent | | GPU 3: GTX 1660 Ti | 6 GB | Gen2 x1 | Dedicated inference agent |

Option A — autoresearch training (Karpathy-style)

Run on GPU 0 (GTX 1080, 8 GB):

git clone https://github.com/karpathy/autoresearch
cd autoresearch && uv sync && uv run prepare.py
# Set device_batch_size=1 or 2, depth=12 in train.py
CUDA_VISIBLE_DEVICES=0 claude --dangerously-skip-permissions \
  -p "Read program.md and begin autonomous research. Use CUDA_VISIBLE_DEVICES=0."

Expected: depth 12 models (~10M params), batch size 1–2, ~12 experiments/hour.

Option B — multi-agent inference

Per-GPU Ollama instances via Docker Compose:

services:
  ollama-gpu0:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    ports: ["11434:11434"]
    volumes: [ollama-gpu0:/root/.ollama]

  ollama-gpu2:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['2']
              capabilities: [gpu]
    ports: ["11435:11434"]
    volumes: [ollama-gpu2:/root/.ollama]

  ollama-gpu3:
    image: ollama/ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['3']
              capabilities: [gpu]
    ports: ["11436:11434"]
    volumes: [ollama-gpu3:/root/.ollama]

| GPU | Port | Model | Use case | Perf | |-----|------|-------|----------|------| | GPU 0 (1080, 8GB) | 11434 | Llama 3.1 8B Q4 | General reasoning | ~25–35 tok/s | | GPU 1 (1660 Ti) | — | Frigate NVR | Detection | — | | GPU 2 (1660 Ti, 6GB) | 11435 | Qwen 2.5 3B / Mistral 7B Q4 | Code agent | ~26–40 tok/s | | GPU 3 (1660 Ti, 6GB) | 11436 | DeepSeek-R1 1.5B / LLaMA 3.2 3B | Fast tasks | ~30–38 tok/s |

This is the build that powers the Autonomous Agent Arena bots running 24/7 on arenabot.io.

6. Similar projects

| Project | What it does | GPU needs | Best for | |---------|--------------|-----------|----------| | Autoresearch | Autonomous LLM training experiments | 1 GPU/agent | ML research | | continuous-claude | Claude Code autonomous PR loop | None | Software dev | | LocalAGI | Self-hosted agent platform | CPU or GPU | General agents | | Codex CLI | Terminal coding agent | None | Code tasks | | Ollama | Local model serving | 1 GPU/model | Inference API | | vLLM | High-throughput inference | 1+ GPUs | Production |

7. Security considerations

Karpathy's warning (Feb 20, 2026)

"I'm definitely a bit sus'd to run OpenClaw specifically — giving my private data/keys to 400K lines of vibe coded monster that is being actively attacked at scale is not very appealing at all."

The OpenClaw crisis

CVE-2026-25253 (CVSS 8.8): 1-click RCE via malicious link, even on localhost.
800+ malicious skills in ClawHub registry (~20% of all skills) — "ClawHavoc" campaign delivered AMOS malware.
42,665 exposed instances, 93.4% with auth bypass.
Crypto-wallet draining via RCE + filesystem access.

Security checklist for an autonomous-agent server

Never expose agent UIs to public internet — use Tailscale.
Run agents in Docker containers with limited filesystem access.
Don't install untrusted skills/plugins from any registry.
Git checkpoint before every autonomous-agent run.
Separate agent network from personal data (Frigate cameras, SSH keys).
Monitor GPU usage for crypto mining.
Pin model versions — don't auto-pull.
Use CUDA_VISIBLE_DEVICES to isolate GPUs per container.

8. Experiment results — what the agent actually discovered

Session #43: 0.9979 → 0.9697 (126 experiments, H100)

23 kept, 102 discarded, 1 crash (18% keep rate). ~10.5 hours.

Top discoveries:

Weight decay on embeddings — baseline had NONE. Adding 0.001 → 0.003 for value embeddings = ~0.0028 total gain.
Init scaling 0.68× — narrow optimum, 0.66× regresses.
Batch halving (524K → 262K): −0.0119 (biggest single win).
Embedding LR 0.9 — −0.0033 via progressive increase.
RoPE base 200K — −0.0012.

Notable failures: weight tying (+2.24 BPB catastrophic), parallel attention+MLP (+0.011), multi-query attention (+0.008).

9. The agentic-engineering evolution

Feb 2025 — Vibe Coding. Karpathy's throwaway tweet that went viral.
Dec 2025 — Turning Point. "Coding agents basically didn't work before December." Higher model quality + ability to stay on task.
Feb 2026 — Agentic Engineering. "You are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight."
Mar 2026 — Post-AGI Moment. Agent built a video-analysis dashboard for Karpathy's home cameras in 30 minutes autonomously: "this is what post-agi feels like :) i didn't touch anything."
Mar 2026 — Vibe Researching. Autoresearch is the first implementation. VibeX 2026 workshop formalizes the concept.

"You're not typing computer code into an editor like the way things were since computers were invented, that era is over."

10. Contradictions and open debates

Agent scientific judgment — agents can implement but can't think. Karpathy's SETI@home vision assumes they'll improve; the gap is the central unsolved problem.
Consumer GPU viability — at depth 12 on 6–8GB GPUs, models are ~10M params — arguably too small for meaningful LLM research. Open question: does hyperparameter research at this scale transfer?
Ollama vs vLLM — for single-user agents, Ollama wins on simplicity. For concurrent access, vLLM handles load better. A 4-thread Athlon might bottleneck vLLM.
OpenClaw vs DIY — OpenClaw is fastest path but terrible security. Ollama + Claude Code + custom scripts is slower but safer.

11. The autoresearch explosion (March 8–16, 2026)

Viral growth

Autoresearch hit 30,307 GitHub stars in one week — one of the fastest-growing repos in GitHub history. 200+ forks. The pattern resonated far beyond ML researchers.

Karpathy's later tweets

March 11 — Intelligence Brownouts:

"My autoresearch labs got wiped out in the oauth outage. Have to think through failovers. Intelligence brownouts will be interesting — the planet losing IQ points when frontier AI stutters."

The Claude Code OAuth outage on March 11 (2:44 PM UTC) broke all CLI-authenticated sessions. 10,000+ Downdetector reports. The underlying API was fine — a hardcoded 15-second OAuth timeout caused the cascading failure. Developers patched it themselves within 11 minutes by extending the timeout constant. Three major outages in six weeks.

Karpathy's implication: as more workflows depend on frontier AI running 24/7, any API outage becomes an "intelligence brownout" — the collective IQ of the planet dips when agents can't think. This is a new class of infrastructure risk.

March 11 — "We Need a Bigger IDE":

"Expectation: the age of the IDE is over. Reality: we're going to need a bigger IDE (imo). It just looks very different because humans now move upwards and program at a higher level — the basic unit of interest is not one file but one agent. It's still programming."

He wants an agent command center — maximize per monitor, toggle visibility, see idle status, pop open terminals, usage stats. tmux grids are "awesome but not enough."

Who's building the agent command center

| Tool | What it does | |------|--------------| | VS Code v1.109 | Added Claude agent support, unified session management for local/background/cloud agents. Microsoft is positioning VS Code as the multi-agent orchestration IDE. | | Google Antigravity | Command center (not just IDE) — watch agents collaborate in visual workflows, real-time debugging, Agent Manager for ecosystem view, drag-and-drop workflow builder. | | Maestro | Multi-agent shared conversations; agents see each other's responses. | | LangSmith Studio | Agent observability with interactive graph visualization. |

Experiment results update

~700 total experiments on nanochat improved GPT-2 training speed by 11% (2.02 h → 1.80 h on 8×H100). Agent discovered ~20 transferable improvements: norm scalers, regularization gaps, attention tuning, initialization problems, optimizer configs — issues the expert developer (Karpathy himself) had overlooked.

Framework fragility noted by Latent Space: GPT-5.4 xhigh failed at sustained looping while Opus 4.6 ran 118+ experiments successfully. Model choice matters enormously for autonomous loops.

Community forks and derivatives

| Project | Domain | What it does | |---------|--------|--------------| | AutoVoiceEvals | Voice AI | Applies the autoresearch loop to voice-agent system prompts. Adversarial eval score as metric. 20 iterations improved scheduling agent 25% → 100% success rate. | | autoexp | Any domain | Generalized autoresearch for any quantifiable metric: prompt tuning, RAG, compiler flags, SQL-query perf, CSS scoring, API latency. | | pi-autoresearch | Web/Software | Extends pattern to test speed, bundle size, Lighthouse scores. | | AutoKernel | GPU | Applies the loop to GPU-kernel optimization. | | Autosearcher | Distributed ML | Multiple agents in parallel sharing discoveries. Rediscovered Kaiming init and RMSNorm from scratch without human guidance. | | Hyperspace AI | Distributed ML | 35 agents on a P2P network ran 333 experiments overnight (March 8–9) completely unsupervised. CEO Varun Mathur. | | macOS fork | Apple Silicon | PyTorch SDPA replacing FlashAttention-3 for MPS / Metal. | | MLX port | Apple Silicon | Native Apple Silicon via MLX framework (in progress). |

Shopify adoption: Tobi Lütke ran 37 overnight experiments — a 0.8B model scored 19% higher than the previous 1.6B model. Smaller, better-optimized models beating larger manually-configured ones.

The generalized "autoresearch pattern"

The deeper insight: autoresearch isn't about ML training. It's a universal optimization loop that works whenever three conditions hold:

Measurable fitness signal — conversion rate, latency, defect rate, eval score, val_bpb.
Repeatable controlled experiments — fast feedback cycles (seconds to minutes).
Automatic keep / discard gate — no subjective judgment needed.

The pattern: modify → run → measure → keep or revert → repeat.

Where it applies beyond ML:

Prompt / system-prompt optimization (AutoVoiceEvals)
RAG pipeline tuning
Compiler-flag optimization
SQL query performance
CSS / design scoring
API latency reduction
Writing-style refinement (treat style guide as optimizable artifact, use embedding similarity as metric)

Key difference from GEPA (Genetic-Pareto prompt evolution, ICLR 2026): GEPA optimizes prompts on frozen models / APIs. Autoresearch modifies model weights via training-code changes. Both share the evolutionary keep/discard loop.

12. The autoresearch timeline

| Date | Event | |------|-------| | Feb 11 | Karpathy praises DeepWiki, uses it for nanochat. | | Feb 20 | Karpathy warns about OpenClaw security ("400K lines of vibe coded monster"). | | Feb 27 | 8-agent experiment (4 Claude, 4 Codex) — "doesn't work but very pretty". | | Mar 5 | Nanochat GPT-2 in 2 hours. First mention of agents iterating automatically. | | Mar 7 | Autoresearch repo released — 630 lines, MIT license. | | Mar 8 | SETI@home vision tweet. Session #43: 126 experiments, 0.9979 → 0.9697. | | Mar 8–9 | Hyperspace: 35 agents, 333 experiments on a P2P network overnight. | | Mar 11 | Claude OAuth outage — "intelligence brownouts." Agent-command-center vision. | | Mar 11 | ~700 total experiments, 11% improvement (2.02 h → 1.80 h GPT-2 training). | | Mar 15 | @archiexzzz launches AutoVoiceEvals — autoresearch for voice AI. | | Mar 16 | 30,307 GitHub stars. Ecosystem: autoexp, pi-autoresearch, AutoKernel, Autosearcher. | | Mar 24 | No Priors podcast interview — deep dive on agent workflow, autoresearch philosophy, Dobby claw, education shift. |

13. No Priors interview (March 24, 2026) — Karpathy's current thinking

YouTube

The "AI psychosis" state

Karpathy describes a perpetual state of "AI psychosis" since December 2025. He went from 80/20 to 2/98 writing code vs delegating to agents. Hasn't typed a line of code since December. The shift is so dramatic that "a normal person actually doesn't realize this happened." When agents fail, it feels like "skill issue" — your instructions were bad, not the capability. This is empowering because it means you can improve.

Autoresearch philosophy (expanded)

The core motivation wasn't just efficiency — it was removing himself as the bottleneck:

"The name of the game is to increase your leverage. I put in very few tokens once in a while and a huge amount of stuff happens on my behalf."

He was surprised it worked because he'd "done this for two decades" and considered nanochat "fairly well tuned." The overnight run still found weight-decay gaps on embeddings and suboptimal Adam betas. Key realization: hyperparameters interact jointly — tuning one changes the optimal value of others. A patient, exhaustive search finds combinations human intuition misses.

Program.md contest idea: same hardware, different program.md files — compete for best improvement. Then feed ALL results to the model to write a better program.md. Meta-optimization over the research organization itself.

"Every research organization is described by program.md — a set of markdown files that describe all the roles and how the whole thing connects."

SETI@home vision (deepened)

Expanded the March 8 tweet into a concrete design:

Untrusted pool of workers contributing commits — expensive to produce, cheap to verify.
Design looks like a blockchain: commits = blocks, experimentation = proof of work.
Security: arbitrary code from internet is "sketchy and dodgy" — need trusted verifiers.
Use case: donate compute to specific research causes (cancer, materials science) instead of donating money.
Reference: Periodic (Liam's company) doing autoresearch for materials science with expensive lab sensors.

"A swarm of agents on the internet could collaborate to improve LLMs and could potentially even run circles around Frontier Labs."

"Layers of an onion" — the abstraction stack

LLM (taken for granted) → Agent (taken for granted) → Claw (autonomous, persistent, looping) → Multiple claws → Instructions to them → Optimization over instructions.

"This is why it gets to the psychosis — this is infinite and everything is skill issue."

Caveat — autoresearch limits

Only works with objective metrics. "If you can't evaluate, you can't autoresearch it."
Perfect fits: CUDA kernels, training hyperparameters, anything with a loss function.
Fails on: nuance, intent, "softer" tasks — anything outside RL training's verifiable domains.
Model jaggedness: "I simultaneously feel like I'm talking to an extremely brilliant PhD student and a 10-year-old." Atoms joke unchanged in 5 years despite massive capability improvements elsewhere.

Apps → APIs — the agent-first web

Built "Dobby the elf claw" for home automation:

Agents discovered Sonos, lights, HVAC, security cameras on LAN.
Qwen vision model on security-camera feed — sends WhatsApp alerts ("FedEx truck pulled up").
Six different apps replaced by natural language through WhatsApp.
"The customer is not the human anymore. It's agents acting on behalf of humans."
"These apps in the app store for smart-home devices shouldn't even exist."

Digital overhang, then physical

Digital transformation comes first: "flipping bits is a million times faster than accelerating matter." Massive unhobling of digital information processing. Physical world (robotics, labs) will lag but is a much bigger market. The digital-physical interface (sensors, actuators) is where interesting companies will emerge next.

Information markets: agents paying for real-world data on demand (photos from conflict zones for prediction markets, lab experiments for materials science).

Education paradigm shift

Micro GPT (200 lines) is the distillation of his two-decade obsession. But he stopped making video guides because:

"I'm not explaining to people anymore. I'm explaining to agents. If agents get it, they'll do the explanation."

Skills as curriculum: script the progression an agent should take a student through. "The things agents can't do is your job now."

Model speciation vs monoculture

Labs pursue monoculture (one model for everything). Karpathy expects speciation — smaller models with the cognitive core intact but specialized for domains (like lean math). The science of "manipulating brains" (fine-tuning without losing capabilities) isn't developed enough yet. Context windows are cheap to manipulate; touching weights is "a lot more tricky."

Open source and power balance

Closed models ~6–8 months ahead, gap converging from 18 months.
Linux analogy: industry NEEDS open common platform.
"By accident we're actually in an okay spot."
"Centralization has a very poor track record" — wants more labs, more people in the room.
Frontier intelligence for Nobel-Prize-level work; open source eats everything else.

Jevons Paradox for software

Software was scarce because expensive. Agents make it cheap. Demand goes UP (bank teller / ATM analogy). "Cautiously optimistic" for software-engineering demand near-term. But long-term: "If we're successful, we're all out of a job... we're just building automation for the board."

Why not rejoin a frontier lab

Financial incentives create misalignment — "you can't really be an independent agent."
Social pressure constrains what you can say.
But being outside means judgment drifts — you lose visibility into what's coming.
Ideal: go back and forth. "I joined, now I'm outside, maybe I'll join again."

Connection points

This is the conceptual ancestor of claude-autoresearch — the milestone-driven, verification-gated version of the same loop, packaged as a Claude Code plugin.
Autonomous Agent Arena is the inference-side build of the four-GPU rig in §5.
Pairs with Agent Harness Engineering — Synthesis — the harness gives reliability for one shot; autoresearch gives reliability over hundreds of shots.

Sources

Autoresearch & nanochat

Multi-agent research

Karpathy's tweets on the 8-agent experiment and the SETI@home vision
Glen Rhodes analysis

Tools

Agentic engineering

Community

Security

Autoresearch explosion

No Priors interview

No Priors podcast — Karpathy on agents and the autoresearch loopy era

>> referenced by (3)

Agent Harness Engineering — Synthesis

...CLAUDE.md harness template plus a tool surface plus an eval gate. - Pairs with [[Karpathy Autoresearch — Deep Research Report]] — the harness gives you reliability for one shot; autoresearch gives you reli...

AI Agents

...eadless agent-driven evals, AI-gateway tracing, and governance. Research - [[Karpathy Autoresearch — Deep Research Report]] — Deep research on autonomous AI agents running ML experiments with one GPU per...

claude-autoresearch

...can run without me babysitting it. Connection points - Direct lineage from [[Karpathy Autoresearch — Deep Research Report|Karpathy's autoresearch concept]] — same modify → run → measure → keep / revert loop, but with milestone-based...